Wastewater-based Epidemiology for COVID-19 Surveillance: A Survey

The pandemic of COVID-19 has imposed tremendous pressure on public health systems and social economic ecosystems over the past years. To alleviate its social impact, it is important to proactively track the prevalence of COVID-19 within communities. The traditional way to estimate the disease prevalence is to estimate from reported clinical test data or surveys. However, the coverage of clinical tests is often limited and the tests can be labor-intensive, requires reliable and timely results, and consistent diagnostic and reporting criteria. Recent studies revealed that patients who are diagnosed with COVID-19 often undergo fecal shedding of SARS-CoV-2 virus into wastewater, which makes wastewater-based epidemiology (WBE) for COVID-19 surveillance a promising approach to complement traditional clinical testing. In this paper, we survey the existing literature regarding WBE for COVID-19 surveillance and summarize the current advances in the area. Specifically, we have covered the key aspects of wastewater sampling, sample testing, and presented a comprehensive and organized summary of wastewater data analytical methods. Finally, we provide the open challenges on current wastewater-based COVID-19 surveillance studies, aiming to encourage new ideas to advance the development of effective wastewater-based surveillance systems for general infectious diseases.


Introduction
The pandemic of COVID-19 has posed significant challenges to public health systems and the global economy, thereby urging the need for effective surveillance methods to monitor the prevalence of the disease within communities.Conventional surveillance methods are heavily dependent on clinical test data, such as positive test cases and hospitalizations.The inherent limitation of clinical data-based surveillance methods lies in their limited coverage, labor intensity, and data staleness due to prolonged test procedures.In order to estimate the prevalence of the disease and detect potential outbreaks in a more timely fashion, wastewater-based epidemiology (WBE 1 ) surveillance has been identified as complementary to clinical methods.
WBE has been successfully used for monitoring the use of parmaceuticals Bischel et al. (2015), illicit drugs Zuccato et al. (2008), flu prevalence Heijnen and Medema (2011), and polio outbreaks Brouwer et al. (2018).Recent research suggests that monitoring the SARS-CoV-2 level in wastewater can be a reliable way to understand the disease prevalence in addition to the clinical test results Safford et al. (2022).Specifically, the wastewater samples can be collected from manholes in the targeted communities or from the wastewater treatment plants (WWTPs) in the sewersheds.The collected samples are then tested to quantify the concentration and the total load of the SARS-CoV-2 virus.The resulting viral concentration/load can be viewed as a comprehensive snapshot of disease prevalence within the community.By collectively analyzing the viral data from multiple timestamps, the trajectory of the disease may be estimated, which can be further used for trend projection.Figure 1 shows the overview of the wastewater-based epidemic surveillance system.
While a promising tool, wastewater-based COVID-19 surveillance is subject to some key limitations and challenges.The first challenge is the variability in viral shedding rates.Specifically, individuals of different symptom severity and age groups may contribute virus to the sewage system at significantly different rates, thus making it hard to approximate the infected population from wastewater viral load.Second, the wastewater viral load may get underestimated due to dilution in the sewer system, in-sewer transportation loss, degradation of the virus, and also the test procedures used.Such loss is inevitable and could lead to missed cases or delayed alerts for outbreaks.On the other hand, the sewershed population, wastewater flow variations, and sample methods may also affect the representativeness of the viral level in the test sample to the disease prevalence of the entire community.Therefore, approximating the actual viral load that flows into the sewage system from degraded signals requires careful modeling and analysis.The last challenge is the integration of wastewater analysis with conventional surveillance results (e.g. reported cases, hospitalization).Wastewater-based surveillance data provides a comprehensive snapshot of disease prevalence within the whole community but with potentially considerable degradation.In contrast, conventional surveillance results are accurate but only cover a limited portion of the infected population.Effectively combining the two data sources can be problematic as the studied populations are not well aligned.
In this paper, we survey the current literature that encompasses critical facets of wastewater-based surveillance for COVID-19, including wastewater sampling techniques, sample testing methodologies, data analysis methods, and available datasets at the global level.Furthermore, we highlight the ongoing challenges in the wastewater-based COVID-19 surveillance systems and hope to inspire continued innovation and development in the domain.It is worth mentioning that the data analytic methods for COVID-19 can be easily generalized to the surveillance tasks for other infectious diseases summarized in Kilaru et al. (2023).Differences with Existing Surveys.Existing surveys on wastewater-based COVID-19 surveillance are predominantly focused on sampling methods, virus detection and quantification, and surveillance system design Sharara et al. (2021); Polo et al. (2020); Shah et al. (2022); Hamouda et al. (2021).In Ciannella et al. (2023); Li et al. (2023c), the two surveys have covered the correlation analysis between viral concentration and clinical test results, but the studies are not comprehensive enough to cover all the critical aspects of the analysis (e.g., sample type, sample frequency, correlation metrics).To the best of our knowledge, this is the first survey that focuses on summarizing the state-of-the-art analytical methods used in wastewater-based COVID-19 surveillance.Survey Structure.The remainder of this survey is organized as follows, Section 2 and Section 3 briefly introduce the current advances in wastewater sampling and sample testing.Section 4 covers different aspects of wastewater analytic methods.Section 5 provides a comprehensive list of wastewater datasets for SARS-CoV-2 surveillance.Section 6 discusses the current limitations and challenges of wastewaterbased COVID-19 surveillance systems, and Section 7 concludes the survey.

Wastewater Sampling
Sampling is a critical step for wastewater-based COVID-19 surveillance, which defines the surveillance scope for the disease.In particular, sampling through the sewage can effectively monitor the viral level at a community level or building level; while sampling at the wastewater treatment plant can estimate the infection level at the sewershed level.In addition to the sample location, sample frequency, sample type, and sample method may also affect the effectiveness of disease surveillance and prevalence estimation.This section summarizes the key findings for the above three aspects of wastewater sampling.Sample Frequency.WBE is an important tool in monitoring the prevalence of SARS-CoV-2 in the community.Depending on the goal of surveillance, sampling frequency can vary.To screen for the presence of the virus, sampling once per week may be sufficient.To identify infection trends, at least three sampling points within a trend period of interest are needed.The National Wastewater Surveillance System (NWSS) suggests using a 15-day surveillance window for trend reporting CDC.Sample Type.Wastewater samples can be categorized into two different types: (1) untreated wastewater from upstream sewage networks like manholes or treatment plant influent, and (2) treated wastewater from primary sludge in the treatment plant after the first solids removal stage.The advantage of using untreated wastewater from the upstream network or influents is that it can reflect fine-grained viral levels in targeted communities Layton et al. (2022); Cohen et al. (2022); Rondeau et al. (2023).However, most untreated wastewater samples need to be concentrated prior to viral extraction.For the treated wastewater samples from primary sludge, the concentration step can be eliminated but the viral level in the sample can only be used to evaluate the disease prevalence in the entire sewershed.Sample Method.To collect wastewater samples, there are two commonly used methods: grab and composite.The grab method collects a fixed amount of wastewater at a certain time, while the composite method collectively pools multiple grab samples over a certain period of time.Nevada showed that the SARS-CoV-2 concentration in the composite sample is 10× higher than the early-morning grab samples.In Augusto et al. (2022), a similar study was conducted to evaluate the variability of SARS-CoV-2 RNA concentration in grab and composite samples from both wastewater treatment plants and sewer manholes in Brazil.Their study showed no significant difference between the viral concentrations of the grab and composite samples.In particular, the concentrations of composite samples showed greater agreement with concentrations of grab samples collected between 8 a.m. to 10 a.m.The low variability between the two types of samples was also observed in a study at a wastewater treatment plant in Norfolk, Virginia Curtis et al. (2020).However, the variability may get amplified when calculating the daily viral load (viral load = viral concentration × daily influent flow) from the viral concentrations.Such findings suggest that grab samples are sufficient to characterize SARS-CoV-2 concentrations.To effectively quantify the total viral load from wastewater, composite samples should be used.

Sample Testing
Sample testing aims to estimate the viral concentration from the wastewater samples, which directly affects the usefulness of downstream data analytic models.Generally, the testing step includes sample pre-processing and virus detection/quantification.To account for the viral loss in the testing step, some lab control methods were introduced to the process.Recent studies suggest that the tested viral concentration should also be normalized with the population served by the sewer system.Correspondingly, different normalization methods were incorporated into the virus quantification model.In this section, we summarize the key advances in sample pre-processing, virus detection and quantification, lab control methods, and normalization methods.Sample Pre-processing.The wastewater samples need to be properly processed before being tested.The purpose of sample pre-processing is to remove solids Jmii et al. (2021) and inactivate virus/bacteria Reynolds et al. (2022).To remove the solids from the sample, centrifugation, and filtration can be performed.Specifically, the filtration needs to be done with large pore sizes (5 or larger) per CDC's guidance CDC.In Yanaç et al. (2022), the authors suggested that SARS-CoV-2 RNA might predominate in solids.Therefore, concentration methods focusing on both supernatant and solid fractions may perform better for virus recovery.For the viral inactivation, effective procedures include thermal treatment Calderón-Franco et al. (2022);McMinn et al. (2021), UV light Castiglioni et al. (2022); Pellegrinelli et al. (2022) or chemical treatment Tomasino et al. (2021).Another key step before sample testing is sample concentration, which can help with the detection of SARS-CoV-2 RNA.The concentration step is particularly helpful for untreated wastewater samples as compared to the treated samples as mentioned in the previous section.Effective concentration approaches include ultrafiltration Dumke et al. (2021); Hasing et al. (2021), filtration through electronegative membrane Barril et al. (2021); Jmii et al. (2021), centrifugal ultrafiltration Anderson-Coughlin et al. (2021), ultracentrifugation Zheng et al. (2022), polyethylene glycol (PEG) precipitation Alexander et al. (2020); Farkas et al. (2021), skim milk flocculation Pino et al. (2021); Philo et al. (2021), and aluminum flocculation Pino et al. (2021); Salvo et al. (2021).Virus Detection and Quantification.With proper preprocessing and concentration, the wastewater sample is then ready to be tested for SARS-CoV-2 RNA detection and quantification.The key step for the method is to quantify the targeted genetic materials (i.e., SARS-CoV-2 N1, N2 and E genes Lu et al. (2020); Corman et al. (2020)) with the polymerase chain reaction (PCR).The main step for the PCR test is using special chemicals and enzymes to amplify the targeted genetic materials in cycles.Once the target genes are amplified, they become detectable by lab methods and can be further interpreted to get the viral concentration in the sample.The most common way for RNA detection and quantification is polymerase chain reaction (PCR)-based quantification Ni et al. (2021).In practice, there are different PCR procedures used for SARS-CoV-2 RNA quantification, including RT-LAMP (reverse transcription loop-mediated isothermal amplification) Amoah et al. (2021), RT-qPCR (reverse transcription-quantitative polymerase chain reaction) Ahmed et al. (2020a), variations of RT-qPCR La Rosa et al. (2020); Navarro et al. (2021), and RT-ddPCR (RTdroplet digital PCR) Flood et al. (2021).In addition to the viral concentration, the number of amplification cycles used to detect the target genes (i.e., the   value) can also be used as a criterion to quantify the viral load.Specifically, the lower the   value, the greater the amount of viral RNA present in the original sample and vice versa.Lab Control Methods.The amount of SARS-CoV-2 virus in the wastewater sample is subject to loss during the sample pre-processing and testing steps.The lost amount may vary by sample quality and testing methods.To assess the lost amount during the process, a frequently used method is using matrix recovery control.A matrix recovery control is a virus that is biologically similar to SARS-CoV-2.Some commonly used control viruses include murine coronavirus (also called murine hepatitis virus), bacteriophage phi6, Pepper Mild Mottle virus (PMMoV), bovine coronavirus, bovine respiratory syncytial virus, and human coronavirus OC43 Ahmed et al. (2020b); Torii et al. (2022); Hata et al. (2020);LaTurner et al. (2021); Nagarkar et al. (2022).Specifically, the matrix recovery control is spiked into the wastewater sample at a known concentration prior to the pre-processing step.The concentration of the control virus will be tested again after the testing step.The ratio of the virus concentrations before pre-processing and after testing can be used to estimate the recovery rate of the SARS-CoV-2 virus during the entire procedure.Normalization.To enable the comparison of viral concentrations across locations and over time, the raw concentrations often need to be normalized by the daily wastewater

Data Analytics for wastewater-based COVID-19 surveillance
In this section, we review the current literature on wastewater data analytic methods from four perspectives, which include viral shedding studies, correlation analysis, estimation models, and uncertainty analysis.Specifically for the estimation models, we divide the current methods into model-driven methods and data-driven methods.The organization of this section is illustrated in Figure 2.

Viral Shedding Studies
The existing viral shedding studies are focused on quantifying the amount of SARS-CoV-2 virus in different types of human waste from infected individuals and the shedding duration of the virus.Shedding Amount.Gupta et al. (2020) reviewed the literature describing COVID-19 patients tested for fecal virus.The review shows that only 53.9% of the infected individuals tested for fecal RNA were positive.A more detailed study was conducted in Jones et al. (2020), which suggests that the SARS-CoV-2 RNA can be detected not only in feces but also occasionally in urine.The likelihood of SARS-CoV-2 being transmitted via feces or urine appears much lower due to the lower relative amounts of virus present in feces/urine.Consequently, the likelihood of infection due to contact with sewage-contaminated water (e.g.swimming, surfing, angling) or food (e.g.salads, shellfish) is extremely low or negligible based on very low abundances and limited environmental survival of SARS-CoV-2.Similar findings were also discovered in Wölfel et al. (2020), where a virological assessment of hospitalized patients with COVID-19 was conducted.Their study indicates that the infectious SARS-CoV-2 virus is exclusively derived from throat or lung samples, but never from blood, urine, or stool samples.
To calibrate the shedding rate of infected individuals, Schmitz et al. studied the WBE for SARS-CoV-2 by enumerating the asymptomatic COVID-19 cases in a university campus Schmitz et al. (2021).The study found that 79.2% of SARS-CoV-2 infections were asymptomatic and only 20.8% were symptomatic.To calculate the shedding rate, positive detected cases from the day before, day-of, and four days after sampling were included in the count of infected individuals contributing to viral shedding.The results showed that the mean fecal shedding rate by the N1 gene was 7.30 ± 0.67 log 10 gc/g-feces (log gene copies per gram-feces).
In addition to the general shedding study on infected individuals, a later study was conducted to explore the association between patient ages and viral shedding amount based on the data from two wastewater sites in Massachusetts Omori et al. (2021).Specifically, the viral load in wastewater was modeled as a combination of viruses contributed by different age groups.By incorporating the case count delay, the wastewater viral load was fitted with the daily case count by different age groups.The results indicate that the virus contribution rate of patients from the 80+ yr age group can be 1.5 times larger than the corresponding rate of patients from the 0-19 yr age group.Shedding Duration.A study from Gupta et al. suggests that the duration of fecal viral shedding mostly ranges from 1 to 33 days after a negative nasopharyngeal swab Gupta et al. (2020).Similar findings were also reported in Wu et al. (2020).Moreover, Wölfel et al. (2020) reveals that fecal virus shedding peaks in the symptomatic period, and declines in the post-symptomatic phase.Miura et al. (2021) modeled the viral shedding kinetics with the collected data under the Bayesian framework.In particular, the duration of viral shedding and the concentration of virus copies in feces over time are jointly estimated.The results showed that the median concentration of SARS-CoV-2 in feces was 3.4 (95% CrI 2 : 0.24-6.5)log gc/g-feces over the entire shedding 2 CrI: Credible Interval.

D R A F T
Wastewater-based epidemiology survey period, and the duration of viral shedding is 26.0 days (95% CrI: 21.7-34.9)from symptom onset date.

Correlation Analysis
The correlations between the wastewater viral level and the clinical data (e.g.cases, hospitalization, death) are extensively studied in the current literature.Table 1 summarizes the correlation studies by their study location, sampling information (i.e., sampling site, method, frequency, and sampling period), and correlation details (i.e., correlation types, correlation variables, correlation strength, and time lag between the two variables).Specifically, the 'Var. 2 lag' column represents the lag of clinical data (i.e.,  2) to viral levels (i.e.,  1).Therefore, a negative lag time means the corresponding clinical data is leading the wastewater viral data.A positive lag time means the clinical data is lagging the viral data.Correlation Metrics.To evaluate the correlation between viral data and clinical data, some commonly used metrics include Pearson correlation,  2 for the linear regression model, Spearman's rank correlation, and Kendall's  correlation.Assume that the time series for wastewater viral data and clinical data are  = { 1 ,  2 , … ,   } and  = { 1 ,  2 , … ,   } where the data pairs (  ,   ) are aligned at timestamp .The correlations between the two time series under different metrics are defined as follows: Pearson correlation: the Pearson correlation   between time series  and  is defined as where x = 1  ∑  =1   and ȳ = 1  ∑  =1   are the mean of the two series.The correlation score has a value between -1 and 1, which reflects the linear correlation of variables.One practical problem of Pearson correlation is its inability to handle noisy data.Specifically, when  and  are both noisy, the correlation of the two series is not guaranteed to be low, which may lead to false positive correlation results. 2 for linear regression model: assume that the clinical data  can be fitted by the wastewater viral data  with linear regression model (i.e., ŷ =  +   ), then the coefficient of determination  2 can be calculated as (2) The  2 for the linear regression model can be used as a complementary metric for Pearson correlation as it can effectively rule out the correlation between noisy data.Spearman's rank correlation: The Spearman's rank correlation is used to evaluate the rank consistency between two data series.To calculate the correlation between  and  , the two time series need to be converted into series of ranks   and   .The correlation coefficient would then be calculated as the Pearson correlation between   and   .
The advantage of Spearman's correlation is that  and  can be related by any monotonic function rather than the linear correlation as in Pearson correlation.Kendall's  correlation: The Kendall's  correlation is defined by the concordance of data pairs.Specifically, for any pairs of data (  ,   ) and (  ,   ), the two pairs are considered concordant if the sort order of (  ,   ) and (  ,   ) agrees.
Based on that, the correlation can be calculated as The Kendall's  correlation is similar to Spearman's rank correlation but is generally preferred when the sample size is small and when there are many tied values in the time series.Moreover, extensive correlation studies suggested that the correlation between viral concentration and new cases (either daily new or weekly new cases) is stronger than that of active cases and cumulative cases.Also, as the shedding duration of the SARS-CoV-2 virus can be as long as several weeks, the correlations between wastewater viral data and reported cases are often stronger in the pre-peak phase than in the post-peak phase Róka et al. (2021).Normalizing viral data with fecal indicators can also improve the analysisRóka et al. ( 2021 2022).The estimated prevalence was found to be significantly higher than the reported clinical cases in the area due to asymptomatic cases and unreported cases.To thoroughly understand the gap between disease prevalence and reported cases, Layton et al. performed randomized door-to-door nasal swab sampling events in different Oregon communities to infer the community COVID-19 prevalence Layton et al. (2022).The estimated prevalence data was then compared with the reported positive cases and the wastewater concentration in the community.Statistical results show that the wastewater viral concentrations were more highly correlated with the estimated community prevalence than with clinically reported cases.Similar results were also observed in Claro et

Model-driven Methods
Shedding Model-based Methods.The key idea for shedding model-based estimation methods is to directly use viral concentration/load and human shedding profiles to estimate the total infected population.The pioneer work was proposed in Ahmed et al. (2020aAhmed et al. ( , 2021)), which studied the WBE for SARS-CoV-2 in Australia.In particular, the prevalence of SARS-CoV-2 in the sewershed was estimated using the following formula: where   is the number of grams of feces contributed by the th individual who was infected on the th day,   is the log 10 maximum RNA copies per gram of feces being shed, and   is the log 10 RNA copies per gram of feces being shed 25 days after being infected.To further account for viral decay in the sewage system, a holding time and system temperature-dependent decay model is applied to   to approximate the viral loss in the collected samples.The proposed framework was fitted into the wastewater surveillance data in South Carolina from May 2020 to August 2020.The model prediction reveals that the rate of unreported COVID-19 cases was approximately 11 times than that of confirmed cases, which aligns well with the independent estimation Directly inferring the SEIR model from the viral load in wastewater may yield unstable results due to noisy viral fluctuations.To address this issue, some statistical models were explored to reconstruct the epidemic model.Fazli et al. (2021) proposed to utilize the partially observed Markov processes model (POMP King et al. (2016)) to infer the population in , , ,  compartments respectively from the observed viral load and reported cases.Depending on the usage of observed data, three different variants were derived from the framework, which includes "SEIR-VY", "SEIR-V" and "SEIR-Y".Specifically, model "SEIR-VY" uses both viral load and case counts to fit the parameters, whereas model "SEIR-Y" and "SEIR-V" utilizes only case counts and viral load, respectively.The evaluation results demonstrated that a simple SEIR model based on viral load data can reliably predict the number of infections in the near future.Another direction of the study was to use the extended Kalman filter (EKF Kalman (1960)) to reconstruct the SEIR model Proverbio et al. ( 2022).The proposed framework was used to infer shedding populations, the effective reproduction number, and future epidemic projections.The framework was tested on the wastewater data from different regions.The results showed that the inferred case number is well correlated with the true detected case numbers with correlation coefficients ranging between 0.7 and 0.9.The study also validated that frequent sampling improves the model calibration and the subsequent reconstruction performance.
The limitation of the previously mentioned SEIR-based framework is that it assumes all the infected individuals follow the same shedding model.In reality, the shedding models of asymptomatic infections and hospitalized infections may vary significantly from each other.To address this issue, Nourbakhsh el al. presented an extended SEIR model as illustrated in the right panel of Figure 3 Nourbakhsh et al. (2022).Specifically, the infected individuals are further categorized into four subgroups: infection (I), infectious and later admitted to hospital (J), asymptomatic infectious (A), and hospitalized (H).Furthermore, considering some recovered cases may still shedding virus through feces, the recovered group is also divided into two subgroups: noninfectious but still shedding virus (Z) and recovered (R).The model was fitted by the clinical data (both hospitalization and confirmed cases) from three Canadian cities and has provided good estimation on actual prevalence, effective reproduction number, and future incidences.In addition, the model was also used to perform exploratory simulations to quantify the effect of surveillance effectiveness, public health interventions, and vaccination on the discordance between clinical and wastewater data.
The aforementioned frameworks are predominately based on single-strain epidemic analysis, which cannot effectively deal with the spread dynamics of multiple strains.In Pell  2023), Pell et al. presented a four-dimensional modified SIR model to study disease dynamics when two strains are circulating in the population.The study was applied to understand the emergence of the SARS-CoV-2 Delta variant in the presence of the Alpha variant using the wastewater data from Massachusetts.In the model, a time delay is incorporated to account for temporary cross-immunity induced by the previous infection with the established (or dominant) strain.The study finds that the time delay does not influence the stability of equilibrium and is hence a harmless delay.However, the equilibrium is governed by the basic reproduction numbers of the two strains in nontrivial ways due to the inclusion of cross-immunity.

Data-driven Methods
Time Series-based Methods.In exploiting the predictive power of the wastewater data from a data-driven perspective, some time series-based methods have demonstrated their effectiveness in short-term forecast tasks.In Karthikeyan et al. (2021), Karthikeyan et al. experimented with the multivariate autoregressive integrated moving average (ARIMA) model to predict the number of new positive cases from the historical case data, wastewater data, and sample collection date in San Diego from July to October 2020.Specifically, the model was used for 1-week to 3-week advance case predictions.To evaluate the model, the Pearson correlation  between the observed cases and predicted cases and the Root Mean Squared Error () of predicted cases were calculated.For the 1, 2, and 3-week advance forecast tasks, the correlation coefficient and  were  = 0.79, 0.69, and 0.47 and  = 50, 59, and 70, respectively.
In Cao and Francis (2021), a vector autoregression (VAR) model was utilized to predict new cases from historical cases and viral concentration in Indiana (PA) from April 2020 to February 2021.The Mean Average Percentage Error ( ) for 1-3 week case predictions were 11.85%, 8.97% and 21.57%, respectively.The study suggests that short time series can reliably predict cases 1-week ahead but are not adequate for predicting cases 3 weeks ahead.To improve the robustness of long-term prediction tasks, a longer time series is needed.Moreover, the paper also studied whether different representations of viral data would affect the prediction results.Their study shows that the log-scaled representation of viral concentration has the best interpretation ability of the data, while the original viral concentration has a stronger forecasting ability under the VAR model framework.
The ARIMA model and VAR model were systematically compared in a wastewater surveillance study in Detroit from September 2020 to August 2021 Zhao et al. (2022).The study showed that the autoregression model with seasonal patterns (SARIMA) and the VAR model are more effective in predicting COVID-19 incidence compared to the ARIMA model.Specifically, the correlation between VAR predicted cases and observed cases is around 0.95 to 0.96 for the 1week advance forecast task.Similarly, the correlation for the SARIMA-model is around 0.94 to 0.95.While for the ARIMA model, the correlation is only around 0.4 to 0.67.
Another line of time series-based methods is derived from the spatiotemporal methods, which take both spatial information of sewersheds and temporal information of viral load into account in the estimation model.In Li et al. (2023a), Li et al. proposed a spatially continuous statistical model that quantifies the relationship between viral concentration and a collection of covariates including socio-demographics, land cover and virus-associated genomic characteristics at the sewersheds while accounting for spatial and temporal correlation.The model is used to predict the weekly viral concentration at the population-weighted centroid of the 32,844 Lower Super Output Areas (LSOAs) in England, then aggregate these LSOA predictions to the Lower Tier Local Authority level (LTLA).In addition, the model is also used to quantify the probability of change directions (decrease or increase) in viral concentration over short periods.Non-time Series-based Methods.A wide range of regression models have been applied to the wastewater data for case prediction due to the ease of implementation and explanability.The simplest regression model assumes that the number of cumulative cases at time  +  is linearly related to the viral concentration at time , and has demonstrated its effectiveness for short-term case prediction Joseph-Duran et al. (2022).In Li et al. (2023b), Li et al. applied the random forest model to predict COVID-19-induced weekly new hospitalizations in 159 counties across 45 states in the United States of America (USA).In particular, different models were established to predict three different hospitalization indicators: weekly new hospitalizations, census inpatient sum, census inpatient average.For each hospitalization indicator, a variety of features, such as Community Vulnerability Index Smittenaar et al. (2021), vaccination coverage, population size, weather, viral concentration, and wastewater temperature, were fed into the model.The study showed that the model can accurately predict the countylevel weekly new admissions, allowing a preparation window of 1-4 weeks.In addition, it also suggests updating the training model periodically to ensure accuracy and transferability, with mean absolute error within 4-6 patients/100k population for upcoming weekly new hospitalization numbers.In Aberi et al. (2021)  (2022).In addition to the wastewater data, some relevant atmospheric variables (e.g.rainfall, humidity, temperature) are also considered in the models.The results showed that the LOESS model yields the least prediction error with  2 = 0.88.The  2 for the linear and GAM model are 0.85 and 0.87, respectively.By changing the prediction period, the study found that the reliability of the model predictions could change by time due to different causes such as the change of SARS-CoV-2 variants.In Anneser et al. (2022), the linear and the GAM model were compared with Poisson model and Negative Binomial model to predict the cases from the wastewater data in the three New England regions.The models that fit the data best were linear, GAM, and Poisson model with very small differences on  2 and .The same set of models were tested on the wastewater data in Oklahoma city from November 2020 to March 2021, with some sociodemographic factors (e.g.age, race and income) considered in the models Kuhn et al. (2022).The best results were obtained using a multivariate Poisson model.Consistent with the finding in Vallejo et al. (2022), the performance of the Poisson model varies by the time of study.Specifically, its accuracy decayed from 92%, during November 2020 until the end of January 2021, to 59% during February and March 2021.In Morvan et al. (2022), the shedding model in (4) and gradient boosted regression trees (GBRT) were combined to estimate the COVID prevalence in England with the wastewater data from 45 sewage sites.The estimated prevalence was within 1.1% of the estimates from representative prevalence survey Morvan et al. (2022).In Xiao et al. (2022), the changing dynamics between the reported cases and wastewater viral load were explicitly studied.Specifically, the clinical reported cases were modeled as the convolution between the scaled wastewater data and an unknown transfer function.It was hypothesized that the transfer function could be fit by a beta distribution.The model was fit into the wastewater surveillance data in the Boston area from March 2020 to May 2021.The results showed that the transfer function has a broad peak and long tail before mid August 2020, indicating that the process of infected individuals getting counted as cases has a broad distribution, with some individuals getting reported very quickly but others taking up to weeks.In this case, wastewater viral load can be used as an early indicator of disease dynamics before clinical test results come back positive.After mid August, the transfer function becomes more sharply peaked, indicating that wastewater and reported cases track each other closely.Consequently, wastewater viral load have less utility as an early warning signal as increased clinical testing capacity effectively captures new infections in a timely manner.
In addition to the aforementioned simple regression models, some deep learning-based models are also explored for the wastewater-based epidemic surveillance tasks Zhu et al. (2022); Jiang et al. (2022); Li et al. (2021a); Galani et al. (2022).Specifically, the artificial neural network model (ANN) and adaptive neuro fuzzy inference system (ANFIS) have proven effective in different studies for case prediction tasks when compared with linear models and random forest Li et al. (2021a).By incorporating the catchment information, weather, clinical testing coverage, and vaccination rate features into the ANN model, the effective reproduction rate can be estimated as studied in Jiang et al. (2022).
Aside from the effectiveness of learning models, the features used to feed the learning models may also have an impact on the prediction results.In Li et al. (2021a), the study indicated that the air and wastewater temperature played a critical role in the prevalence estimation by data-driven models.Also, normalizing and smoothing the wastewater data Aberi et al. (2021) or transforming the viral load into log scale Vallejo et al. (2022) can help in fitting the models as well.To better understand the spread of the disease and the effect of public health response, Xiao et al. proposed to monitor the ratio between wastewater viral load and clinical cases (WC ratio) and the time lag between wastewater and clinical reporting in addition to viral load alone Xiao et al. (2022).Specifically, when the WC ratio is high, it implies that the existing testing capacity has not kept pace with exponentially rising new cases, which nevertheless are detected in wastewater surveillance.Conversely, a low WC ratio indicates that clinical tests are capturing the majority of infections reflected in wastewater viral load.When this ratio is stable and low, it implies that the existing testing capacity is sufficient to assess the extent of new infections.The time lag, on the other hand, may reflect the accessibility of test facilities.In Kuhn et al. (2022), Kuhn et al. showed the lag was significantly lower for areas with a higher household income and a higher proportion of the population aged 65 or older, but higher for areas with a high proportion of Hispanic inhabitants.

D R A F T
Wastewater-based epidemiology survey

Uncertainty Analysis
The accuracy of wastewater-based COVID-19 surveillance is limited by the uncertainty and inevitable viral loss introduced in each process step.Recall that the key steps for WBE systems are virus shedding, in-sewer transportation, sampling, testing and data analytics as shown in Figure 1, the uncertainties associated with those steps are illustrated in Figure 4.
In Li et al. (2021b), Li et al. systematically studied the uncertainty in estimating SARS-CoV-2 prevalence by WBE.The study suggested that the uncertainty caused by the excretion rate can become limited for the prevalence estimation when the number of infected persons in the catchment area is more than 10.As for the sampling methods, grab sampling contributed the highest uncertainty (around 30% on average) while a continuous flow-proportional sampling method showed <10%.The uncertainty introduced at the testing stage was the dominant factor.Therefore, it is important to use surrogate viruses as internal or external standards during the virus test process.Overall, WBE can be considered as a reliable complementary surveillance strategy for SARS-CoV-2 with reasonable uncertainty (20-40%).

Datasets
This section summarizes the global wastewater datasets that are publicly available in Table 2.For each dataset, the country, data granularity, area covered, time granularity, time span, current status, and corresponding website are listed.The data granularity represents the aggregation level of wastewater signals, which could range from buildinglevel to country-level.The area covered column shows the monitoring area of the dataset.Time granularity is used to indicate the sampling frequency of the wastewater data.Specifically, for the datasets labeled with '>1/week', more than one data point were observed in one week overall, but the actual weekly samples may vary along the course.The time span specifies the sampling period of the dataset, while the 'live' column indicates whether the data in the website is still getting updated or not.Lastly, the 'website' column gives the link to the dataset.

Challenges and Future Directions
Wastewater-based epidemiology has been used as an effective tool to complement conventional clinical testing methods for COVID-19 surveillance.Although substantial efforts have been made in the area, there are still many challenges to be addressed in future research.Three important problems that are worth exploring are identified as follows.Shedding Variability.Current studies predominantly assume that the infected individuals follow a uniform shedding model with only a few works to account for the variability of the shedding profile.In fact, the shedding amount and duration of SARS-CoV-2 in feces can vary widely between individuals and over time.Factors such as the stage of infection, disease severity, vaccination condition, and individual health condition may all affect the shedding profile.As the shedding model is often directly used to estimate the disease prevalence together with the total viral load in the wastewater, it is therefore crucial to construct customized shedding profiles for different infected individuals.Sample Testing and Virus Quantification.Long-term wastewater-based COVID-19 surveillance is an economical way to detect the outbreak of disease and emerging variants Karthikeyan et al. (2022); Lamba et al. (2022).One critical problem for the surveillance systems is the allocation of test resources.To be specific, given a limited budget for sample test resources, it is important to choose the sampling locations and frequency by considering the catchment size, and serving population in the area so that potential outbreaks can be detected as early as possible.On the other hand, the virus quantified in the wastewater sample may not reflect the actual amount of virus entering the sewage system because of the limited sensitivity of lab methods and viral decay in the sewage system.Therefore, it is essential to improve the lab testing methods and understand the virus decay model under various environmental parameters (e.g., temperature, wastewater pH, etc.).Data Analytics.The majority of the existing literature takes the wastewater data as a standalone signal for epidemic analysis from site to site, while little effort has been made to study the wastewater data from multiple sites collectively for spatial-temporal pattern analysis.Compared to the standalone analysis, the spatial-temporal analysis is more useful for reconstructing the epidemic process at a panoramic scale.In particular, for some large metropolitan areas that can be divided into multiple sewersheds, local residents may contribute to different sites due to the commute from residential areas to commercial areas.In this way, it is hard to recover the disease spread process without tackling the interdependency across sites.The main obstacle to this research direction is the comparability of the data from different sites.Specifically, the sample collection methods, testing methods, and sewage system structure may vary by site.Correspondingly, the same viral load from different sewersheds may represent different epidemic conditions in reality.To this end, how to effectively compile those data into a uniform framework can be a challenging task to address.On the other hand, it is also critical to integrate the mobility pattern of local residents into the analytical model so that the dependency across the sites can be effectively unraveled.
Moreover, the uncertainties introduced in the wastewater analytic pipeline are not negligible as explained in Section 4.4.Therefore, it is important to quantify the uncertainty together with the prediction results to ensure the reliability of the results.

Conclusion
Wastewater-based epidemiology has been demonstrated as a powerful tool for COVID-19 surveillance and trend projection within communities.This survey summarizes the wastewater sampling techniques, sample testing methods, data analytical models, and the existing wastewater datasets D R A F T Wastewater-based epidemiology survey at a global level.In particular, this survey provides a new taxonomy of data analytical models to help the researcher and practitioner form a systematic view of the area.Most importantly, the reviewed data analytical models can be easily generalized to many other infectious diseases, which can be referred to as guidance to build general disease surveillance systems.Moreover, the comprehensive wastewater datasets at different granularity can serve as a benchmark for validating new surveillance models at various scales.Last but not least, the challenges in the area are discussed, which may help inspire researchers in their future research directions.Acknowledgements.We thank members of the Biocomplexity COVID-19 Response Team and the Network Systems Science and Advanced Computing (NSSAC) Division of the University of Virginia for their thoughtful comments and suggestions related to epidemic modeling and response support.We also thank scientists at the Virginia Department of Health (VDH) and the Division of Consolidated Laboratory Services (DCLS) for their collaboration.This

Figure 1 :
Figure 1: Overview of Wastewater-based Epidemiology Surveillance System.
Compared with composite samples, grab samples only represent the viral levels at a single moment, which may vary by sampling time, fluctuations in wastewater flow, and study locations.In Gerrity et al. (2021), a wastewater study in

Figure 2 :
Figure 2: The overview of wastewater data analytics.
); Scott et al. (2021); Tandukar et al. (2022); Nagarkar et al. (2022); D'Aoust et al. (2021a,b); Perez-Zabaleta et al. (2023);Mohapatra et al. (2023).In addition to the aforementioned factors, the availability of home test kits has significantly affected the correlation between wastewater viral data and clinical data.Varkila et al. (2023) analyzed the time series of 268 counties in 22 states from January to September 2022.The study showed that SARS-CoV-2 wastewater metrics accurately reflected high clinical rates of disease in early 2022, but this association declined over time as home testing increased.
Aside from the correlation strength, many existing correlation studies also investigated the lag time between clinical data and wastewater viral load.The lag time may vary significantly by time, location, and catchment population due to the variant accessibility of testing resources and epidemiological conditions of the population Zhao et al. (2022); Kuhn et al. (2022); Bertels et al. (2023); Acosta et al. (2022); López-Peñalver et al. (2023); Belmonte-Lopes et al. (2023).In most cases, the viral load in wastewater is a leading indicator for clinical data, with leading time ranging from 1 day to 2 weeks during peak times considering the time-lag between infection and test confirmation, and asymptomatic infections Yanaç et al. (2022); Lemaitre et al. (2020).However, the viral load may become a lagging indicator during the infection declining phase due to prolonged viral shedding duration Gerrity et al. (2021).On the other hand, the lag time for different clinical data types may follow different distributions as well.In general, the lag times of positive tests are shorter than the hospitalization admissions.The lag of hospitalization is further shorter than the death cases Peccia et al. (2020); D'Aoust et al. (2021a); Krivoňáková et al. (2021).Correlation with Estimated Prevalence.In some areas, the wastewater viral data was used to estimate the disease prevalence in the sewersheds by leveraging the personal shedding rate and Monte Carlo simulations Wang et al. (2021); de Freitas Bueno et al. ( al. (2021); Pillay et al. (2021); González-Reyes et al. (2021); de Sousa et al. (2022); Saththasivam et al. (2021).

Figure 3 :
Figure 3: The SEIR model and extended SEIR model used in the current literature.
work was partially supported by University of Virginia Strategic Investment Fund award number SIF160, National Institutes of Health (NIH) Grant 1R01GM109718, NSF Expeditions in Computing Grant CCF-1918656, VDH Contract UVABIO610-GY23, NSF Grant CCF-1908308, PG-CoE CDC-RFA-CK22-2204, VDH Contract UVABIO610-GY23.Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the funding agencies.This journal article was supported by the Office of Advanced Molecular Detection, Centers for Disease Control and Prevention through Cooperative Agreement Number CK22-2204.Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Centers for Disease Control and Prevention.
Influential Factors for Correlation.The correlation strength between wastewater viral data and clinical data can be affected by many factors.Li et al. (2023c) conducted a systematic review and meta-analysis on the correlation between SARS-CoV-2 RNA concentration and COVID-19 cases.
Pillay et al. (2021) that the correlation coefficients are potentially affected by environmental factors (e.g.temperature, humidity), epidemiological conditions (e.g.vaccination rate, clinical test coverage), WBE sampling design (e.g.sampling method and frequency), and catchment population (e.g.human mobility, demographics of inhabitants) Li et al. (2023c); Jiang et al. (2022); Rasero et al. (2022); Kuhn et al. (2022);Pillay et al. (2021).In particular, larger variations in air temperature and clinical testing coverage, and the increase of catchment size have strong negative impacts on the correlation between viral concentration and COVID-19 cases.The sampling techniques have a negligible impact on the correlation but increasing the sampling frequency can improve the correlation.

Table 1 :
Summarization of Correlation Studies between Viral Data and Clinical Data.
The study showed that simple models like PL and KNN outperform more complex models such as GAM, SVR, and MLP with slight differences.Similarly,Vallejo etal.applied linear regression, generalized additive model and locally estimated scatterplot smoothing model (LOESS) for COVID case prediction in Northwest Spain Vallejo et al.